From bb907155585ca3bf794cf4d41171ad810c207304 Mon Sep 17 00:00:00 2001
From: "akw27@labyrinth.cl.cam.ac.uk" <akw27@labyrinth.cl.cam.ac.uk>
Date: Fri, 27 Aug 2004 14:59:26 +0000
Subject: [PATCH] bitkeeper revision 1.1159.58.1
 (412f4c4egZceX9qbmExr-wa_i_VDWw)

Notes on the innerworkings of the blkif drivers.
---
 .rootkeys                        |   1 +
 docs/blkif-drivers-explained.txt | 477 +++++++++++++++++++++++++++++++
 2 files changed, 478 insertions(+)
 create mode 100644 docs/blkif-drivers-explained.txt

diff --git a/.rootkeys b/.rootkeys
index bbf5c23855..3498de4277 100644
--- a/.rootkeys
+++ b/.rootkeys
@@ -12,6 +12,7 @@
 4021053fmeFrEyPHcT8JFiDpLNgtHQ docs/HOWTOs/Xen-HOWTO
 4022a73cgxX1ryj1HgS-IwwB6NUi2A docs/HOWTOs/XenDebugger-HOWTO
 3f9e7d53iC47UnlfORp9iC1vai6kWw docs/Makefile
+412f4bd9sm5mCQ8BkrgKcAKZGadq7Q docs/blkif-drivers-explained.txt
 3f9e7d60PWZJeVh5xdnk0nLUdxlqEA docs/eps/xenlogo.eps
 3f9e7d63lTwQbp2fnx7yY93epWS-eQ docs/figs/dummy
 3f9e7d564bWFB-Czjv1qdmE6o0GqNg docs/interface.tex
diff --git a/docs/blkif-drivers-explained.txt b/docs/blkif-drivers-explained.txt
new file mode 100644
index 0000000000..8f6f7a498a
--- /dev/null
+++ b/docs/blkif-drivers-explained.txt
@@ -0,0 +1,477 @@
+=== How the Blkif Drivers Work ===
+Andrew Warfield
+andrew.warfield@cl.cam.ac.uk
+
+The intent of this is to explain at a fairly detailed level how the
+split device drivers work in Xen 1.3 (aka 2.0beta).  The intended
+audience for this, I suppose, is anyone who intends to work with the
+existing blkif interfaces and wants something to help them get up to
+speed with the code in a hurry.  Secondly though, I hope to break out
+the general mechanisms that are used in the drivers that are likely to
+be necessary to implement other drivers interfaces.
+
+As a point of warning before starting, it is worth mentioning that I
+anticipate much of the specifics described here changing in the near
+future.  There has been talk about making the blkif protocol
+a bit more efficient than it currently is.  Keir's addition of grant
+tables will change the current remapping code that is used when shared
+pages are initially set up.
+
+Also, writing other control interface types will likely need support
+from Xend, which at the moment has a steep learning curve... this
+should be addressed in the future.
+
+For more information on the driver model as a whole, read the
+"Reconstructing I/O" technical report
+(http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-xenngio.pdf).
+
+==== High-level structure of a split-driver interface ====
+
+Why would you want to write a split driver in the first place?  As Xen
+is a virtual machine manager and focuses on isolation as an initial
+design principle, it is generally considered unwise to share physical
+access to devices across domains.  The reasons for this are obvious:
+when device resources are shared, misbehaving code or hardware can
+result in the failure of all of the client applications.  Moreover, as
+virtual machines in Xen are entire OSs, standard device drives that
+they might use cannot have multiple instantiations for a single piece
+of hardware.  In light of all this, the general approach in Xen is to
+give a single virtual machine hardware access to a device, and where
+other VMs want to share the device, export a higher-level interface to
+facilitate that sharing.  If you don't want to share, that's fine.
+There are currently Xen users actively exploring running two
+completely isolated X-Servers on a Xen host, each with it's own video
+card, keyboard, and mouse.  In these situations, the guests need only
+be given physical access to the necessary devices and left to go on
+their own.  However, for devices such as disks and network interfaces,
+where sharing is required, the split driver approach is a good
+solution.
+
+The structure is like this:
+
+   +--------------------------+  +--------------------------+
+   | Domain 0 (privileged)    |  | Domain 1 (unprivileged) |
+   |                          |  |                          |
+   | Xend ( Application )     |  |                          |
+   | Blkif Backend Driver     |  | Blkif Frontend Driver    |
+   | Physical Device Driver   |  |                          |
+   +--------------------------+  +--------------------------+
+   +--------------------------------------------------------+
+   |                X       E       N                       |
+   +--------------------------------------------------------+
+
+
+The Blkif driver is in two parts, which we refer to as frontend (FE)
+and a backend (BE).  Together, they serve to proxy device requests
+between the guest operating system in an unprivileged domain, and the
+physical device driver in the physical domain.  An additional benefit
+to this approach is that the FE driver can provide a single interface
+for a whole class of physical devices.  The blkif interface mounts
+IDE, SCSI, and our own VBD-structured disks, independent of the
+physical driver underneath.  Moreover, supporting additional OSs only
+requires that a new FE driver be written to connect to the existing
+backend.
+
+==== Inter-Domain Communication Mechanisms ====
+
+===== Event Channels =====
+
+Before getting into the specifics of the block interface driver, it is
+worth discussing the mechanisms that are used to communicate between
+domains.  Two mechanisms are used to allow the construction of
+high-performance drivers: event channels and shared-memory rings.
+
+Event channels are an asynchronous interdomain notification
+mechanism.  Xen allows channels to be instantiated between two
+domains, and domains can request that a virtual irq be attached to
+notifications on a given channel.  The result of this is that the
+frontend domain can send a notification on an event channel, resulting
+in an interrupt entry into the backend at a later time.
+
+The event channel between two domains is instantiated in the Xend code
+during driver startup (described later).  Xend's channel.py
+(tools/python/xen/xend/server/channel.py) defines the function
+
+
+def eventChannel(dom1, dom2):
+    return xc.evtchn_bind_interdomain(dom1=dom1, dom2=dom2)
+
+
+which maps to xc_evtchn_bind_interdomain() in tools/libxc/xc_evtchn.c,
+which in turn generates a hypercall to Xen to patch the event channel
+between the domains.  Only a privileged domain can request the
+creation of an event channel.
+
+Once the event channel is created in Xend, its ends are passed to both the
+front and backend domains over the control channel.  The end that is
+passed to a domain is just an integer "port" uniquely identifying the
+event channel's local connection to that domain.  An example of this
+setup code is in linux-2.6.x/drivers/xen/blkfront/blkfront.c in
+blkif_status_change, which receives several status change events as
+the driver starts up.  It is passed an event channel end in a
+BLKIF_INTERFACE_STATUS_CONNECTED message, and patches it in like this:
+
+
+   blkif_evtchn = status->evtchn;
+   blkif_irq    = bind_evtchn_to_irq(blkif_evtchn);
+   if ( (rc = request_irq(blkif_irq, blkif_int, 
+                          SA_SAMPLE_RANDOM, "blkif", NULL)) )
+       printk(KERN_ALERT"blkfront request_irq failed (%ld)\n",rc);
+
+
+This code associates a virtual irq with the event channel, and
+attaches the function blkif_int() as an interrupt handler for that
+irq.  blkif_int() simply handles the notification and returns, it does
+not need to interact with the channel at all.
+
+An example of generating a notification can also be seen in blkfront.c:
+
+
+static inline void flush_requests(void)
+{
+    DISABLE_SCATTERGATHER();
+    wmb(); /* Ensure that the frontend can see the requests. */
+    blk_ring->req_prod = req_prod;
+    notify_via_evtchn(blkif_evtchn);
+}
+}}}
+
+notify_via_evtchn issues a hypercall to set the event waiting flag on
+the other domain's end of the channel.
+
+===== Communication Rings =====
+
+Event channels are strictly a notification mechanism between domains.
+To move large chunks of data back and forth, Xen allows domains to
+share pages of memory.  We use communication rings as a means of
+managing access to a shared memory page for message passing between
+domains.  These rings are not explicitly a mechanism of Xen, which is
+only concerned with the actual sharing of the page and not how it is
+used, they are however worth discussing as they are used in many
+places in the current code and are a useful model for communicating
+across a shared page.
+
+A shared page is set up by a guest first allocating and passing the
+address of a page in its own address space to the backend driver.  
+
+
+   blk_ring = (blkif_ring_t *)__get_free_page(GFP_KERNEL);
+   blk_ring->req_prod = blk_ring->resp_prod = resp_cons = req_prod = 0;
+   ...
+   /* Construct an interface-CONNECT message for the domain controller. */
+   cmsg.type      = CMSG_BLKIF_FE;
+   cmsg.subtype   = CMSG_BLKIF_FE_INTERFACE_CONNECT;
+   cmsg.length    = sizeof(blkif_fe_interface_connect_t);
+   up.handle      = 0;
+   up.shmem_frame = virt_to_machine(blk_ring) >> PAGE_SHIFT;
+   memcpy(cmsg.msg, &up, sizeof(up));  
+
+
+blk_ring will be the shared page.  The producer and consumer pointers
+are then initialised (these will be discussed soon), and then the
+machine address of the page is send to the backend via a control
+channel to Xend.  This control channel itself uses the notification
+and shared memory mechanisms described here, but is set up for each
+domain automatically at startup.
+
+The backend, which is a privileged domain then takes the page address
+and maps it into its own address space (in
+linux26/drivers/xen/blkback/interface.c:blkif_connect()):
+
+
+void blkif_connect(blkif_be_connect_t *connect)
+
+   ...
+   unsigned long shmem_frame = connect->shmem_frame;
+   ...
+
+   if ( (vma = get_vm_area(PAGE_SIZE, VM_IOREMAP)) == NULL )
+   {
+      connect->status = BLKIF_BE_STATUS_OUT_OF_MEMORY;
+      return;
+   }
+
+   prot = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED);
+   error = direct_remap_area_pages(&init_mm, VMALLOC_VMADDR(vma->addr),
+                                   shmem_frame<<PAGE_SHIFT, PAGE_SIZE,
+                                   prot, domid);
+
+   ...
+
+   blkif->blk_ring_base = (blkif_ring_t *)vma->addr
+}}}
+
+The machine address of the page is passed in the shmem_frame field of
+the connect message.  This is then mapped into the virtual address
+space of the backend domain, and saved in the blkif structure
+representing this particular backend connection.
+
+NOTE:  New mechanisms will be added very shortly to allow domains to
+explicitly grant access to their pages to other domains.  This "grant
+table" support is in the process of being added to the tree, and will
+change the way a shared page is set up.  In particular, it will remove
+the need of the remapping domain to be privileged.
+
+Sending data across shared rings:
+
+Shared rings avoid the potential for write interference between
+domains in a very cunning way.  A ring is partitioned into a request
+and a response region, and domains only work within their own space.
+This can be thought of as a double producer-consumer ring -- the ring
+is described by four pointers into a circular buffer of fixed-size
+records.  Pointers may only advance, and may not pass one another.
+
+
+                          rsp_cons----+
+                                      V
+           +----+----+----+----+----+----+----+
+           |    |    |    free      |RSP1|RSP2|
+           +----+----+----+----+----+----+----+
+ req_prod->|    |       -------->        |RSP3|
+           +----+                        +----+
+           |REQ8|                        |    |<-rsp_prod
+           +----+                        +----+
+           |REQ7|                        |    |
+           +----+                        +----+
+           |REQ6|       <--------        |    |
+           +----+----+----+----+----+----+----+
+           |REQ5|REQ4|    free      |    |    |
+           +----+----+----+----+----+----+----+
+  req_cons---------^
+
+
+
+By adopting the convention that every request will receive a response,
+not all four pointers need be shared and flow control on the ring
+becomes very easy to manage.  Each domain manages its own
+consumer pointer, and the two producer pointers are visible to both (Xen/include/hypervisor-ifs/io/blkif.h):
+
+
+
+/* NB. Ring size must be small enough for sizeof(blkif_ring_t) <=PAGE_SIZE.*/
+  #define BLKIF_RING_SIZE        64
+
+  ...
+
+/*
+ * We use a special capitalised type name because it is _essential_ that all
+ * arithmetic on indexes is done on an integer type of the correct size.
+ */
+typedef u32 BLKIF_RING_IDX;
+
+/*
+ * Ring indexes are 'free running'. That is, they are not stored modulo the
+ * size of the ring buffer. The following macro converts a free-running counter
+ * into a value that can directly index a ring-buffer array.
+ */
+#define MASK_BLKIF_IDX(_i) ((_i)&(BLKIF_RING_SIZE-1))
+
+typedef struct {
+    BLKIF_RING_IDX req_prod;  /*  0: Request producer. Updated by front-end. */
+    BLKIF_RING_IDX resp_prod; /*  4: Response producer. Updated by back-end. */
+    union {                   /*  8 */
+        blkif_request_t  req;
+        blkif_response_t resp;
+    } PACKED ring[BLKIF_RING_SIZE];
+} PACKED blkif_ring_t;
+
+
+
+As shown in the diagram above, the rules for using a shared memory
+ring are simple.  
+
+ 1. A ring is full when a domain's producer and consumer pointers are
+    equal (e.g. req_prod == resp_cons).  In this situation, the
+    consumer pointer must be advanced.  Furthermore, if the consumer
+    pointer is equal to the other domain's producer pointer,
+    (e.g. resp_cons = resp_prod), then the other domain has all the
+    buffers.
+
+2. Producer pointers point to the next buffer that will be written to.
+   (So blk_ring[MASK_BLKIF_IDX(req_prod)] should not be consumed.)
+
+3. Consumer pointers point to a valid message, so long as they are not
+   equal to the associated producer pointer.
+
+4. A domain should only ever write to the message pointed
+   to by its producer index, and read from the message at it's
+   consumer.  More generally, the domain may be thought of to have
+   exclusive access to the messages between its consumer and producer,
+   and should absolutely not read or write outside this region.
+
+In general, drivers keep a private copy of their producer pointer and
+then set the shared version when they are ready for the other end to
+process a set of messages.  Additionally, it is worth paying attention
+to the use of memory barriers (rmb/wmb) in the code, to ensure that
+rings that are shared across processors behave as expected.
+
+==== Structure of the Blkif Drivers ====
+
+Now that the communications primitives have been discussed, I'll
+quickly cover the general structure of the blkif driver.  This is
+intended to give a high-level idea of what is going on, in an effort
+to make reading the code a more approachable task.
+
+There are three key software components that are involved in the blkif
+drivers (not counting Xen itself).  The frontend and backend driver,
+and Xend, which coordinates their initial connection.  Xend may also
+be involved in control-channel signalling in some cases after startup,
+for instance to manage reconnection if the backend is restarted.
+
+===== Frontend Driver Structure =====
+
+The frontend domain uses a single event channel and a shared memory
+ring to trade control messages with the backend.  These are both setup
+during domain startup, which will be discussed shortly.  The shared
+memory ring is called blkif_ring, and the private ring indexes are
+resp_cons, and req_prod.  The ring is protected by blkif_io_lock.
+Additionally, the frontend keeps a list of outstanding requests in
+rec_ring[].  These are uniquely identified by a guest-local id number,
+which is associated with each request sent to the backend, and
+returned with the matching responses.  Information about the actual
+disks are stored in major_info[], of which only the first nr_vbds
+entries are valid.  Finally, the global 'recovery' indicates that the
+connection between the backend and frontend drivers has been broken
+(possibly due to a backend driver crash) and that the frontend is in
+recovery mode, in which case it will attempt to reconnect and reissue
+outstanding requests.
+
+The frontend driver is single-threaded and after setup is entered only
+through three points:  (1) read/write requests from the XenLinux guest
+that it is a part of, (2) interrupts from the backend driver on its
+event channel (blkif_int()), and (3) control messages from Xend
+(blkif_ctrlif_rx).
+
+===== Backend Driver Structure =====
+
+The backend driver is slightly more complex as it must manage any
+number of concurrent frontend connections.  For each domain it
+manages, the backend driver maintains a blkif structure, which
+describes all the connection and disk information associated with that
+particular domain.  This structure is associated with the interrupt
+registration, and allows the backend driver to have immediate context
+when it takes a notification from some domain.
+
+All of the blkif structures are stored in a hash table (blkif_hash),
+which is indexed by a hash of the domain id, and a "handle", really a
+per-domain blkif identifier, in case it wants to have multiple connections.
+
+The per-connection blkif structure is of type blkif_t.  It contains
+all of the communication details (event channel, irq, shared memory
+ring and indexes), and blk_ring_lock, which is the backend mutex on
+the shared ring.  The structure also contains vbd_rb, which is a
+red-black tree, containing an entry for each device/partition that is
+assigned to that domain.  This structure is filled by xend passing
+disk information to the backend at startup, and is protected by
+vbd_lock.  Finally, the blkif struct contains a status field, which
+describes the state of the connection.
+
+The backend driver spawns a kernel thread at startup
+(blkio_schedule()), which handles requests to and from the actual disk
+device drivers.  This scheduler thread maintains a list of blkif
+structures that have pending requests, and services them round-robin
+with a maximum per-round request limit.  blkifs are added to the list
+in the interrupt handler (blkif_be_int()) using
+add_to_blkdev_list_tail(), and removed in the scheduler loop after
+calling do_block_io_op(), which processes a batch of requests.  The
+scheduler thread is explicitly activated at several points in the code
+using maybe_trigger_blkio_schedule().
+
+Pending requests between the backend driver and the physical device
+drivers use another ring, pending_ring.  Requests are placed in this
+ring in the scheduler thread and issued to the device.  A completion
+callback, end_block_io_op, indicates that requests have been serviced
+and generates a response on the appropriate blkif ring.  pending
+reqs[] stores a list of outstanding requests with the physical drivers.
+
+So, control entries to the backend are (1) the blkio scheduler thread,
+which sends requests to the real device drivers, (2) end_block_io_op,
+which is called as serviced requests complete, (3) blkif_be_int()
+handles notifications from the frontend drivers in other domains, and
+(4) blkif_ctrlif_rx() handles control messages from xend.
+
+==== Driver Startup ====
+
+Prior to starting a new guest using the frontend driver, the backend
+will have been started in a privileged domain.  The backend
+initialisation code initialises all of its data structures, such as
+the blkif hash table, and starts the scheduler thread as a kernel
+thread. It then sends a driver status up message to let xend know it
+is ready to take frontend connections.
+
+When a new domain that uses the blkif frontend driver is started,
+there are a series of interactions between it, xend, and the specified
+backend driver.  These interactions are as follows:
+
+The domain configuration given to xend will specify the backend domain
+and disks that the new guest is to use.  Prior to actually running the
+domain, xend and the backend driver interact to setup the initial
+blkif record in the backend.
+
+(1) Xend sends a BLKIF_BE_CREATE message to backend.
+
+  Backend does blkif_create(), having been passed FE domid and handle.
+  It creates and initialises a new blkif struct, and puts it in the
+  hash table.
+  It then returns a STATUS_OK response to xend.
+
+(2) Xend sends a BLKIF_BE_VBD_CREATE message to the backend.
+ 
+  Backend adds a vbd entry in the red-black tree for the
+  specified (dom, handle) blkif entry.
+  Sends a STATUS_OK response.
+
+(3) Xend sends a BLKIF_BE_VBD_GROW message to the backend.
+
+  Backend takes the physical device information passed in the 
+  message and assigns them to the newly created vbd struct.
+
+(2) and (3) repeat as any additional devices are added to the domain.
+
+At this point, the backend has enough state to allow the frontend
+domain to start.  The domain is run, and eventually gets to the
+frontend driver initialisation code.  After setting up the frontend
+data structures, this code continues the communications with xend and
+the backend to negotiate a connection:
+
+(4) Frontend sends Xend a BLKIF_FE_DRIVER_STATUS_CHANGED message.
+
+  This message tells xend that the driver is up.  The init function
+  now spin-waits until driver setup is complete in order to prevent
+  Linux from attempting to boot before the disks are connected.
+
+(5) Xend sends the frontend an INTERFACE_STATUS_CHANGED message
+
+  This message specifies that the interface is now disconnected
+  (instead of closed).
+  The domain updates it's state, and allocates the shared blk_ring
+  page.  Next, 
+
+(6) Frontend sends Xend a BLKIF_INTERFACE_CONNECT message
+
+  This message specifies the domain and handle, and includes the
+  address of the newly created page.
+
+(7) Xend sends the backend a BLKIF_BE_CONNECT message
+
+  The backend fills in the blkif connection information, maps the
+  shared page, and binds an irq to the event channel.
+  
+(8) Xend sends the frontend an INTERFACE_STATUS_CHANGED message
+
+  This message takes the frontend driver to a CONNECTED state, at
+  which point it binds an irq to the event channel and calls
+  xlvbd_init to initialise the individual block devices.
+
+The frontend Linux is stall spin waiting at this point, until all of
+the disks have been probed.  Messaging now is directly between the
+front and backend domain using the new shared ring and event channel.
+
+(9) The frontend sends a BLKIF_OP_PROBE directly to the backend.
+
+  This message includes a reference to an additional page, that the
+  backend can use for it's reply.  The backend responds with an array
+  of the domains disks (as vdisk_t structs) on the provided page.
+
+The frontend now initialises each disk, calling xlvbd_init_device()
+for each one.
-- 
2.30.2